47 research outputs found
TensorFlow on state-of-the-art HPC clusters: a machine learning use case
The recent rapid growth of the data-flow programming paradigm enabled the development of specific architectures,
e.g., for machine learning. The most known example is the Tensor Processing Unit (TPU) by Google. Standard data-centers, however, still can not foresee large partitions dedicated to machine learning specific architectures. Within data-centers, the High-Performance Computing (HPC) clusters are highly parallel machines targeting a broad class of compute-intensive workflows, as such they can be used for tackling machine learning challenges. On top of this, HPC architectures are rapidly changing, including accelerators and instruction sets other than the classical x86 CPUs. In this blurry scenario, identifying which are the best hardware/software configurations to efficiently support machine
learning workloads on HPC clusters is not trivial. In this paper, we considered the workflow of TensorFlow for image recognition. We highlight the strong dependency of the performance in the training phase on the availability of arithmetic libraries optimized for the underlying architecture. Following the example of Intel leveraging the MKL libraries for improving the TensorFlow performance, we plugged the Arm Performance Libraries into TensorFlow and tested on an HPC cluster based on Marvell ThunderX2 CPUs. Also, we performed a scalability study on three state-of-the-art HPC clusters based on different CPU architectures, x86 Intel Skylake, Arm-v8 Marvell ThunderX2, and PowerPC IBM Power9.Postprint (author's final draft
MPI+X: task-based parallelization and dynamic load balance of finite element assembly
The main computing tasks of a finite element code(FE) for solving partial
differential equations (PDE's) are the algebraic system assembly and the
iterative solver. This work focuses on the first task, in the context of a
hybrid MPI+X paradigm. Although we will describe algorithms in the FE context,
a similar strategy can be straightforwardly applied to other discretization
methods, like the finite volume method. The matrix assembly consists of a loop
over the elements of the MPI partition to compute element matrices and
right-hand sides and their assemblies in the local system to each MPI
partition. In a MPI+X hybrid parallelism context, X has consisted traditionally
of loop parallelism using OpenMP. Several strategies have been proposed in the
literature to implement this loop parallelism, like coloring or substructuring
techniques to circumvent the race condition that appears when assembling the
element system into the local system. The main drawback of the first technique
is the decrease of the IPC due to bad spatial locality. The second technique
avoids this issue but requires extensive changes in the implementation, which
can be cumbersome when several element loops should be treated. We propose an
alternative, based on the task parallelism of the element loop using some
extensions to the OpenMP programming model. The taskification of the assembly
solves both aforementioned problems. In addition, dynamic load balance will be
applied using the DLB library, especially efficient in the presence of hybrid
meshes, where the relative costs of the different elements is impossible to
estimate a priori. This paper presents the proposed methodology, its
implementation and its validation through the solution of large computational
mechanics problems up to 16k cores
Computational Fluid and Particle Dynamics Simulations for Respiratory System: Runtime Optimization on an Arm Cluster
Computational fluid and particle dynamics simulations (CFPD) are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes involves high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on both Intel- and Arm-based HPC clusters showing the importance of using mechanisms applied at runtime to improve the performance independently of the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2X, keeping the computational resources constant.This work is partially supported by the Spanish
Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat
de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPostprint (author's final draft
Dynamic load balancing for hybrid applications
DLB relies on the usage of hybrid programming models and exploits the malleability of the second level of parallelism to redistribute computation power across processes
Lessons learned from a performance analysis and optimization of a multiscale cellular simulation
This work presents a comprehensive performance analysis and optimization of a
multiscale agent-based cellular simulation. The optimizations applied are
guided by detailed performance analysis and include memory management, load
balance, and a locality-aware parallelization. The outcome of this paper is not
only the speedup of 2.4x achieved by the optimized version with respect to the
original PhysiCell code, but also the lessons learned and best practices when
developing parallel HPC codes to obtain efficient and highly performant
applications, especially in the computational biology field
Runtime Mechanisms to Survive New HPC Architectures: A Use-Case in Human Respiratory Simulations
Computational Fluid and Particle Dynamics (CFPD) simulations are of paramount importance for studying and improving drug effectiveness. Computational requirements of CFPD codes demand high-performance computing (HPC) resources. For these reasons we introduce and evaluate in this paper system software techniques for improving performance and tolerate load imbalance on a state-of-the-art production CFPD code. We demonstrate benefits of these techniques on Intel-, IBM-, and Arm-based HPC technologies ranked in the Top500 supercomputers, showing the importance of using mechanisms applied at runtime to improve the performance independently of
the underlying architecture. We run a real CFPD simulation of particle tracking on the human respiratory system, showing performance improvements of up to 2x, across different architectures, while applying runtime techniques and keeping constant the computational resources.This work is partially supported by the Spanish Government (SEV-2015-0493), by the Spanish Ministry of Science and Technology project (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and by the European Mont-Blanc projects (288777, 610402 and 671697).Peer ReviewedPreprin
Leveraging HPC Profiling & Tracing Tools to Understand the Performance of Particle-in-Cell Monte Carlo Simulations
Large-scale plasma simulations are critical for designing and developing
next-generation fusion energy devices and modeling industrial plasmas. BIT1 is
a massively parallel Particle-in-Cell code designed for specifically studying
plasma material interaction in fusion devices. Its most salient characteristic
is the inclusion of collision Monte Carlo models for different plasma species.
In this work, we characterize single node, multiple nodes, and I/O performances
of the BIT1 code in two realistic cases by using several HPC profilers, such as
perf, IPM, Extrae/Paraver, and Darshan tools. We find that the BIT1 sorting
function on-node performance is the main performance bottleneck. Strong scaling
tests show a parallel performance of 77% and 96% on 2,560 MPI ranks for the two
test cases. We demonstrate that communication, load imbalance and
self-synchronization are important factors impacting the performance of the
BIT1 on large-scale runs.Comment: Accepted by the Euro-Par 2023 workshops (TDLPP 2023), prepared in the
standardized Springer LNCS format and consists of 12 pages, which includes
the main text, references, and figure
Bronchial Aspirate-Based Profiling Identifies MicroRNA Signatures Associated With COVID-19 and Fatal Disease in Critically Ill Patients
Background: The pathophysiology of COVID-19-related critical illness is not completely understood. Here, we analyzed the microRNA (miRNA) profile of bronchial aspirate (BAS) samples from COVID-19 and non-COVID-19 patients admitted to the ICU to identify prognostic biomarkers of fatal outcomes and to define molecular pathways involved in the disease and adverse events.
Methods: Two patient populations were included (n = 89): (i) a study population composed of critically ill COVID-19 and non-COVID-19 patients; (ii) a prospective study cohort composed of COVID-19 survivors and non-survivors among patients assisted by invasive mechanical ventilation (IMV). BAS samples were obtained by bronchoaspiration during the ICU stay. The miRNA profile was analyzed using RT-qPCR. Detailed biomarker and bioinformatics analyses were performed.
Results: The deregulation in five miRNA ratios (miR-122-5p/miR-199a-5p, miR-125a-5p/miR-133a-3p, miR-155-5p/miR-486-5p, miR-214-3p/miR-222-3p, and miR-221-3p/miR-27a-3p) was observed when COVID-19 and non-COVID-19 patients were compared. In addition, five miRNA ratios segregated between ICU survivors and nonsurvivors (miR-1-3p/miR-124-3p, miR-125b-5p/miR-34a-5p, miR-126-3p/miR-16-5p, miR-199a-5p/miR-9-5p, and miR-221-3p/miR-491-5p). Through multivariable analysis, we constructed a miRNA ratio-based prediction model for ICU mortality that optimized the best combination of miRNA ratios (miR-125b-5p/miR-34a-5p, miR-199a-5p/miR-9-5p, and miR-221-3p/miR-491-5p). The model (AUC 0.85) and the miR-199a-5p/miR-9-5p ratio (AUC 0.80) showed an optimal discrimination value and outperformed the best clinical predictor for ICU mortality (days from first symptoms to IMV initiation, AUC 0.73). The survival analysis confirmed the usefulness of the miRNA ratio model and the individual ratio to identify patients at high risk of fatal outcomes following IMV initiation. Functional enrichment analyses identified pathological mechanisms implicated in fibrosis, coagulation, viral infections, immune responses and inflammation.
Conclusions: COVID-19 induces a specific miRNA signature in BAS from critically ill patients. In addition, specific miRNA ratios in BAS samples hold individual and collective potential to improve risk-based patient stratification following IMV initiation in COVID-19-related critical illness. The biological role of the host miRNA profiles may allow a better understanding of the different pathological axes of the disease.We want particularly to acknowledge the patients, Biobank IdISBa and CIBERES Pulmonary Biobank Consortium (PT17/0015/0001), a member of the Spanish National Biobanks Network financed by the Carlos III Health Institute, with the participation of the Units of Intensive Care, Clinical Analysis and Pulmonology of Hospital Universitario Son Espases and Hospital Son Llatzer for their collaboration. This work was also supported by IRBLleida Biobank (B.0000682) and Plataforma Biobancos PT17/0015/0027/.Peer Reviewed"Article signat per 25 autors/es: Marta Molinero, Iván D. Benítez, Jessica González, Clara Gort-Paniello, Anna Moncusí-Moix, Fátima Rodríguez-Jara, María C. García-Hidalgo, Gerard Torres, J. J. Vengoechea, Silvia Gómez, Ramón Cabo, Jesús Caballero, Jesús F. Bermejo-Martin, Adrián Ceccato, Laia Fernández-Barat, Ricard Ferrer, Dario Garcia-Gasulla, Rosario Menéndez, Ana Motos, Oscar Peñuelas, Jordi Riera, Antoni Torres, Ferran Barbé and David de Gonzalo-Calvo* on behalf of the CIBERESUCICOVID Project (COV20/00110 ISCIII)"Postprint (published version
Optimization of condensed matter physics application with OpenMP tasking model
The Density Matrix Renormalization Group (DMRG++) is a condensed matter physics application used to study superconductivity properties of materials. It’s main computations consist of calculating hamiltonian matrix which requires sparse matrix-vector multiplications. This paper presents task-based parallelization and optimization strategies of the Hamiltonian algorithm. The algorithm is implemented as a mini-application in C++ and parallelized with OpenMP. The optimization leverages tasking features, such as dependencies or priorities included in the OpenMP standard 4.5. The code refactoring targets performance as much as programmability. The optimized version achieves a speedup of 8.0 × with 8 threads and 20.5 × with 40 threads on a Power9 computing node while reducing the memory consumption to 90 MB with respect to the original code, by adding less than ten OpenMP directives.This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV2015-0493), by the Spanish Ministry of Science and Technology (project TIN2015-65316-P), by the Generalitat de Catalunya (contract 2017-SGR-1414) and by the BSC-IBM Deep Learning Research Agreement, under JSA “Application porting, analysis and optimization for POWER and POWER AI”. This work was partially supported by the Scientific Discovery through Advanced Computing (SciDAC) program funded by U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research and Basic Energy Sciences, Division of Materials Sciences and Engineering. This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.Peer ReviewedPostprint (author's final draft